Structured classification for multilingual natural language processing

نویسنده

  • Phil Blunsom
چکیده

This thesis investigates the application of structured sequence classification models to multilingual natural language processing (NLP). Many tasks tackled by NLP can be framed as classification, where we seek to assign a label to a particular piece of text, be it a word, sentence or document. Yet often the labels which we’d like to assign exhibit complex internal structure, such as labelling a sentence with its parse tree, and there may be an exponential number of them to choose from. Structured classification seeks to exploit the structure of the labels in order to allow both generalisation across labels which differ by only a small amount, and tractable searches over all possible labels. In this thesis we focus on the application of conditional random field (CRF) models (Lafferty et al., 2001). These models assign an undirected graphical structure to the labels of the classification task and leverage dynamic programming algorithms to efficiently identify the optimal label for a given input. We develop a range of models for two multilingual NLP applications: word-alignment for statistical machine translation (SMT), and multilingual supertagging for highly lexicalised grammars. The first half of this thesis is dedicated to the task of word-alignment for SMT, which aims to find a mapping from words in a source language sentence to words in a target translation sentence. We treat this problem as one of structured classification with a CRF, where the input is the parallel sentence pair and the output is the index for each word in the source sentence of its aligned translation (or null) in the target sentence. By exploiting the ability of the CRF model to incorporate a diverse range of features we are able to explore many binary and real-valued features. Orthographic and syntactic features are defined which aim to generalise sparse word-to-word translation features. In addition, we define powerful features from unsupervised generative models, and collocation statistics derived

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Identification of Disease Symptoms in Multilingual Sentences: An Ontology-Driven Approach

In this paper we present a Multilingual Ontology-Driven framework for Text Classification (MOoD-TC). This framework is highly modular and can be customized to create applications based on Multilingual Natural Language Processing for classifying domain-dependent contents. In order to show the potential of MOoD-TC, we present a case study in the e-Health domain.

متن کامل

NLGbAse: A Free Linguistic Resource for Natural Language Processing Systems

Availability of labeled language resources, such as annotated corpora and domain dependent labeled language resources is crucial for experiments in the field of Natural Language Processing. Most often, due to lack of resources, manual verification and annotation of electronic text material is a prerequisite for the development of NLP tools. In the context of under-resourced language, the lack o...

متن کامل

Dbnary: Wiktionary as a Lemon Based RDF Multilingual Lexical Resource

Contributive resources, such as Wikipedia, have proved to be valuable to Natural Language Processing or multilingual Information Retrieval applications. This work focusses on Wiktionary, the dictionary part of the resources sponsored by the Wikimedia foundation. In this article, we present our effort to extract multilingual lexical data from Wiktionary data and to provide it to the community as...

متن کامل

Automatic Multilingual Indexing and Natural Language Processing

The number of documents being collected by information brokers such as bibliographic database producers, libraries and publishers increases rapidly. The consequence is a huge demand for indexing and classification. So far this has had to be carried out manually. The system AUTINDEX, which is described in this paper offers tools for monolingual as well as for multilingual automatic indexing and ...

متن کامل

DBnary: Wiktionary as a Lemon-based multilingual lexical resource in RDF

Contributive resources, such as Wikipedia, have proved to be valuable to Natural Language Processing or multilingual Information Retrieval applications. This work focusses on Wiktionary, the dictionary part of the resources sponsored by the Wikimedia foundation. In this article, we present our extraction of multilingual lexical data from Wiktionary data and to provide it to the community as a M...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007